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Abstract. We investigate the cycles in the transcription network of Sac- 
charomyces cerevisiae. Unlike a similar network of Escherichia coli, it 
contains many cycles. We characterize properties of these cycles and 
their place in the regulatory mechanism of the cell. 

Almost all cycles in the transcription network of Saccharomyces cere- 
visiae are contained in a single strongly connected component, which we 
call LSCC (L for "largest"), except for a single cycle of two transcription 
factors. 

Among different physiological conditions, cell cycle has the most signif- 
icant relationship with Lscc, as the set of 64 transcription interactions 
that are active in all phases of the cell cycle has overlap of 27 with the 
interactions of Lscc (of which there are 49). 

Conversely, if we remove the interactions that are active in all phases of 
the cell cycle (fewer than 1% of the total), the Lscc would have only 
three nodes and 5 edges, 4 of which are active only in the stress response 
subnetwork. 

Lscc has a special place in the topology of the network and it can be 
used to define a natural hierarchy in the network; in every physiological 
subnetwork Lscc plays a pivotal role. 

Apart from those well-defined conditions, the transcription network of 
Saccharomyces cerevisiae is devoid of cycles. It was observed that two 
conditions that were studied and that have no cycles of their own are 
exogenous: diauxic shift and DNA repair, while cell cycle, sporulation are 
endogenous. Perhaps, during the slow recovery phase, the stress response 
is endogenous as well. 



1 Background 



Cycles have a central role in control of continuing processes (for an example, see 
Hartwell [3]). Therefore we expect the regulatory mechanism of a cell to have 
many cycles of interactions. Only some of these interactions have the form of a 
transcription factor (TF for short) regulating expression of a target gene. Our 
question is: given that there are cycles of transcription interactions, are they 
important in the regulation of life processes? 

Graph properties of the regulatory networks have been reported in a number 
of papers. Shen-Orr et al. [S] analyzed the regulatory networks statically and 



observed certain characteristic motifs that are more frequent than in the ran- 
dom model and which have functional significance (while other small subgraphs 
are significantly less frequent). Luscombe et al. [1] studied the dynamics of the 
regulatory network of Saccharomyces cerevisiae as it changes for multiple con- 
ditions and proposed a method for the statistical analysis of network dynamics. 
They have found large changes in the topology of the network and compared it 
with random graphs. 

We have found that the transcription network of Saccharomyces cerevisiae 
contains a single large strongly connected component (a union of overlapping cy- 
cles), which we call LSCC, and that the topology changes discussed by Luscombe 
et al. [4] are well reflected within Lscc, in spite of its small size. 

Yu and Gerstein [12| have examined at the structure of regulatory networks 
and showed that it exhibits a certain natural hierarchy. We propose another 
hierarchical partition of the network: above the Lscc, the Lscc, below the Lscc 
and "parallel" to the Lscc (see Table [T]) and we show that this partition is in 
some sense natural. 

Comparisons of biological networks with random graphs were subject of 
methodological investigations of Barabasi and Albert [1] who proposed a scale- 
free model. This model is rather difficult to apply in the case of directed graphs 
that have large asymmetry between edge beginnings and ends; one can have a 
separate model for the out-degrees — a power law, and for the in-degrees — 
a Poisson distribution, but parameters of such graphs converge very slowly, so 
a model based on such parameters can be misleading. Therefore Milo et al. [B] 
(see also Newman et al. [7 ) proposed several methods of generating graphs that 
have the same in- and out-degrees of the reference network. We used the faster 
and somewhat biased variant of their "matching algorithm" . 

2 Results and Discussion 

In the data set of Luscombe et al. [4] we can see the Lscc with 25 TFs and one 
small strongly connected component with two TFs. 

To see if the cycles of the Lscc are significant, we checked how the topological 
changes of the transcription network during various physiological conditions are 
reflected inside the Lscc, we checked several graph characteristics of the TFs 
in the Lscc, and we compared the characteristics of the Lscc to the cycles in 
random networks. 

2.1 General characterization of the cycles 

Size of Lscc in the expected range. The cycles form two connected compo- 
nents, one "small" , consisting of 2 TFs, and one "large" , consisting of 25 TFs. 

The degenerate component consists of two TFs with indistinguishable inter- 
actions that have self-loops, thus they are TFs of themselves, and of each other. 
This may be a result of a relatively recent gene duplication. Thus we will ignore 
this cycle in our discussions. 



The size of the cychc component is within the range of variabihty for random 
models, and this range, 17-40 (with the average of 30) does not change much 
when we boost the number of elementary motifs. Thus the size alone would allow 
the cyclic component to be a random artifact of other properties of the network. 

It is also typical that there are very few cycles outside the largest strongly 
connected component: the average sum of sizes of other non-trivial strongly 
connected components is 1.4. 

By the way of contrast, the transcription network of Escherichia coli is either 
devoid of cycles or it contains very few of them (depending on the data set, see 
Cosentino Lagomarsino et al. [2]). 

Lscc connected very strongly to the cell cycle. The transcription network 
reported by Luscombe et al. Q has 142 TFs and 7074 interactions, of which 
we disregard 21 "self-loop" interactions. 25 TFs and 49 interactions form the 
Lscc (as they cannot be contained in longer simple cycles). The subnetworks 
associated with the 5 stages of the cell cycle have 64 interactions in common 
(we name this set Ccc, "common to cell cycle"), of which 27 are present in the 
Lscc. If even one of these two sets of interactions were random, the expected 
number of common elements would be smaller than 1 (49 x 64/7053). 

Another way to illustrate how strongly the transcription cycles are connected 
to the cell cycle is to define the following sets of interactions: Ai, all interactions, 
7053 elements, Ccc, 64 elements, PCCC (for proper Ccc ), Ccc without inter- 
actions common to all conditions, 50 elements. 

The number of TFs in cycles of interactions for the set Ai is 27, which is 
close to the average value of 31.4 obtained in random tests. Because Ccc and 
Pccc are so small, the tests for Ai— Pccc and Ai— Ccc should have very similar 
average values, but the actual number drops from 27 to 8 and 5 respectively. 

Cycles of subnetworks other than cell cycle. For subnetwork A we define 
Lscc A as the set of interactions of A that are also in LscC; to measure the 
difference between two sets we use |yl Sj, the number of elements that are in 
one of the sets A, B but not in both. 

When we compare the subnetworks of the cell cycle and sporulation, we 
observe that LscCgp C LscCcc and [LscCgp ® LscCcd = 12. Nevertheless, the 
cycles of LscCgp involve only 7 interactions. 

In terms of |A © stress response is most distant from the cell cycle: 
|LscCsrffiLscCcc| = 32, as |LscCcc-LscCsr| = 22 and |LscCsr - LscCcd = 10. 

Stress reponse is also special in the sense that it has cycles of its own, all of 
which involve TF YAP6 that is not active in any other subnetwork. It seems that 
the cyclic interaction of this TF with two other TFs is a differentiating part of 
stress response condition. Two other conditions, diauxic shift and DNA damage, 
have similar sets of active interactions in Lscc, but they lack 5 interactions 
involving YAP 6. 

One cycle consists of 3 interactions that are common to all conditions, REBl 
SIN3 ^ HSFl ^ REBl. Note that HSFl is a Heat Stress Factor, very impor- 
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Fig. 1. The diagram of Lscc 



tant in the stress response, but also in "basal level sustained transcription" (see 
Mager and Ferreira [S]). One possible role of cycles in stress response is slow- 
ing down the recovery transition from the stress condition, so it can last several 
hours [5]. During the recovery, sporulation and cell cycle activities are supressed. 
In this sense, stress response is partially endogenous to use the classification of 
Luscombe et al. [3] (they group Cell Cycle and Sporulation as endogenous and 
the other conditions as exogenous). 



Lscc has an orderly layout. Fig. [T]shows the graph formed by the transcrip- 
tion factors and interactions of Lscc, with nodes placed on a square grid as to 
minimize the edge lengths. Note that rather few edges (7 of 49) are longer than 
a single square side/diagonal. 

In the diagram, al (apricot color) marks the nodes present in the cycles of 
all subnetworks. The cycles in the diauxic shift and DNA damage subnetworks 
contain only these nodes. (Note that an interaction of Lscc can be active in a 
subnetwork without belonging to a cycle in that subnetwork.) 

The cycles in the sporulation subnetwork sp contain apricot and strawberry 
nodes. 

The cycles in the cell cycle subnetwork cc contain apricot, strawberry and 
cerulean nodes. 
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(a) A pictorial proof tliat {1,3,25} is the (b) Three cyclic units of LSCC with 
unique minimum feedback vertex set. connections. 

Fig. 2. At least three feedback vertices are needed because there exists three 
vertex-disjoint cycles — indicated by wide color strips. If a single vertex selection 
on an indicated cycle suffices for the feedback vertex set then it must intersect 
every cycle that is vertex disjoint with the other indicated cycles; cycles indicated 
with thin color strips show that such selections are unique. 



The cycles in the stress response subnetwork sr contain apricot and sienna 
nodes. 

One can note that the graph does not appear random. One feature is that it 
can be laid on a regular grid with few long edges. The second is that functionally 
defined groups of nodes, al, sp, cc, sr and ds are rather well separated from 
each other. 



Lscc has small feedback vertex set and three natural subunits. Another 
property of Lscc is that it has a small and unique minimum feedback vertex set, 
a set of nodes whose removal destroys all cycles. 

The fact that there exists a unique minimum feedback vertex set with three 
nodes (vertices) can be clearly seen in Fig. 2(a) Let us call this set F = {1,3, 25}. 

We can use F to distinguish three natural cyclic units within Lscc, Sb for 
each b & F. We can think that b is the "boss" of Sb- We define Sb as the union 
of all simple cycles that go through b but not through F — {b}. Only one node 
can have two bosses: {4} = Si H 5*25. Because there is only one path from 1 to 
4 and three disjoint paths from 25 to 4, we remove 4 from 5*1 to make our units 
disjoint. The three sets coincide well with functional categories: S^ = {3, 21, 24} 



are the nodes on cycles of LscCgr, 5*1 are the nodes on cycles of LscCgp, and 
5*25 are the nodes on cycles of LscCcc minus Si (observe that Si is contained in 
LscCcc)- (Actually, S25 has 11 nodes and it has one node that is not in LscCccj 
18, and one node of the cell cycle network is missed, 8.) 

Thus the cyclic subnework has three cyclic parts, plus two acyclic parts: 
5 nodes on paths from >5'25 to S3, and 1 node on a path from 6*25 to Si. We 



show this schematically in Fig. 2(b) These units are related to "large network 



structures" that were observed, but not described, by Lee et al. [TT]. 



2.2 Statistic profile of the TFs from the Lscc 
We tested 1000 random networks generated in three ways: 

1. with the same in- and out-degrees as in the actual network; 

2. the same, but with the number of bi-fans increased to the actual using ran- 
dom swaps of edge ends, and accepting them when they increase the number 
of bi-fans; 

3. similar to the last one, but increasing the number of feed-forward loops. 

These three methods yielded similar results. We will be reporting these results 
in the format a {b, c) where a, b, c will be the averages obtained using method 1, 
2 and 3 respectively. 



Average out-degree in the Lscc. On the average, a transcription factor has 
50 targets, and the average for the transcription factors of the Lscc is 128. This 
agrees with the average 121 (120, 123) for the random model. It can be explained 
by the nodes of large out-degree having much larger chances of joining the cycles. 
In the actual network, of the top 20 most active TFs, the Lscc has 12. 



Average co-regulation in the Lscc and overall. We define the co-regulation 
of two TFs as the number of shared target genes. If two TFs have ti and t2 tar- 
gets, while the total number of targets is n, on the average they share E = tit2/n 
targets. If they actually share A targets, their relative co-regulation is A/E. Over 
all pairs of TFs, the average relative co-regulation is 2.93, and for the pairs in 
the Lscc the average is 2.0. The explanation is that the relative co-regulation 
tends to be high for the TFs with small number of targets (when the expected 
co-regulation is very small). If we increase the number of bi-fans by random 
re-wiring, the average co-regulation increases modestly, because the gains occur 
mostly for the TFs with a large number of targets. Therefore the "generating 
force" of bi-fans is not the random re-wiring. Mutation by duplication (see Te- 
ichmann and Babu [10]) can explain this pattern — a duplicated pair has large 
co-regulation even if it has small out-degree. 

In the random networks, the relative co-regulation is smaller. When we boost 
the number of bi-fans, this should increase co-regulation. However, it is much 
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The paths are computed in the 
graph of SCC's, in particular, we 
view LSCC as a single node. The en- 
try in column i and row j shows the 
number of nodes with these prop- 
erties: on the longest path through 
node u has i + j edges and the 
longest path from u to another node 
(a TT) has i edges (consequently, 
the longest path from another node 
to u has j edges.) Note that the only 
way a node may be on a path of 
length 3 is when it has an edge from 
the node that corresponds to Lscc. 



Table 1. Classifying TFs and TTs by their positions on the longest paths. 



Top Cycle Complex Simple Exception 

Top (9) 168 373 373 121 12 

Cycle (25) 1132 696 249 21 

Complex (65) 638 259 31 

Simple (38) 169 57 

Exception (2) 14 

Table 2. Co-regulation of various hierarchy classes 



easier to increase the number of bi-fans at random for a node with large out- 
degree, and as a result, this process actually decreases the average relative co- 
regulation (to 0.87 from 1.27). This test does show that the distribution of bi- 
fans, relative to the out-dcgrcc of the participating nodes, is very different than 
in a random network with the same number of bi-fans. 



Position of Lscc in the hierarchy. Only 9 TFs are "upstream" from the 
Lscc in the sense that there are paths from these TFs to the LscC; of these 9 
paths 8 are single edges and one consists of two edges. If we consider that path 
to be exception, collectively the cyclic component has unambiguous hierarchical 
position 2nd from the top. In a random network, on the average we have 17 
(16.8, 14.8) "upstream" TFs. In this sense, the cyclic component is higher in the 
hierarchy than the average in the random model. 

In the random model we can see that most of the long paths go through the 
large cyclic component, in the actual network this is even more so. Every TF 
(with two exceptions) which is on a path of length 3 or more either belongs to 
the cyclic component, or it can reach the cyclic component, or it can be reached 
from it. This means that 38 TFs are on very short paths only and form a rather 



separate part of the transcription network, while 104 TFs belong to paths of 
length 3 or more. The length of the longest path measured when we collapse the 
cyclic component to single nodes is 13 in the actual network, and on the average 
9.4 (9.2, 9.4). 

Yu and Gerstein [12] propose a partition of networks according to the length 
of shortest paths to those TFs that have only TTs as their targets. This definition 
would not work with the length of the shortest paths to TTs: this length is 1 for 
all TFs but ten, and for that ten, it is 2, so the hierarchy would be trivial. 

Because Lscc has such a special and statistically significant position in the 
network, we propose to partition TFs by their relation to Lscc, as it is indicated 
in Table [TJ 

Table [5] considers five classes of TFs from Table [1] the fifth class consisting 
of two TFs that do not fit into our schema. For each pair of classes we give the 
number of TTs that are co-regulated by TFs from those two classes (positions 
like Top- Top give the number of TTs regulated by that class alone, the size of 
each class is given in parenthesis). 

We performed our study using the data of Luscombe et al. [4j because we 
wanted to compare the cycles with physiological subnetworks described in their 
paper. Nevertheless, we compared our definition of a hierarchy with that of Yu 
and Gerstein [12j . who performed their investigation in a larger transcription 
network. 

When we apply our program to the latter network, the proportions between 
the class sizes remain similar: Top (20), Cycle (63), Complex (114), Simple (84) 
and Exception (5). 

We performed two tests applied by Yu and Gerstein to their classes. When 
we checked the percentage of essential genes in our classes, we got 15% in Top 
and Cycle, 13% in Complex and 12% in Simple, a more uniform distribution 
than among classes of Yu and Gerstein. A more striking difference exists when 
we check the percentage of cancer related genes: 25% in Top, 16% in Cycle, 12% 
in Simple and below 3% in Complex. 

The division we propose is closely related to the notion proposed by Yu and 
Gerstein: a division of transcription control mechanisms into reflex processes 
and cogitation processes. Simple clearly corresponds to reflex processes. In a 
cogitation process, one that involves a long path of interactions, we can partition 
the process into beginning, middle and the ending part. As the various paths have 
very different lengths, identifying Lscc as the middle is both "objective" and 
independent from the path length, and in the same time quite arbitrary. However, 
we show in the next subsection that Lscc has a "switchboard" property even in 
the physiological conditions in which paths do not form cycles, and we just have 
seen that the percentage of cancer related genes sharply drops as we move from 
the middle to the final part of the long paths. 



^ The maximum length of a simple path is perhaps a better measure, but it requires 
a much more complex program to compute it. It is closely related to the feedback 
vertex set problem. 




Fig. 3. Parts of LSCC that are active during endogenous condition (or, conditions 
with larger number of active cycles) . 




Fig. 4. Parts of Lscc that are active during exogenous condition (or, conditions 
with the fewest active cycles). 



2.3 Topological changes inside Lscc 

In Fig. [3] and Fig.|4]we can see the interactions of Lscc that are active in various 
physiological conditions. Wc can observe large difference between the subnet- 
works, both in the composition and in topological characteristics like average 
path length. 

Because so many paths of TF go through Lscc, the differences between 
average path lengths that were observed for different subnetworks by Luscombe 
et al. [4] are largely caused by the different presence of these networks in the 
Lscc. In Table[3]we use PercentPath to denote the percentage of the shortest 
paths from transcription factors to the terminal targets that either originate or 
go through Lscc, and PercentLength to denote the similar percentage for 
the sum of lengths of shortest paths. 
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Table 3. Importance of Lscc in the paths of different subnetworks 



Table [3] shows that even in DNA damage and diauxic shift subnetworks the 
majority of shortest paths between TFs and TTs goes throgh LscC; we may say 
that Lscc has a role of a switchboard. 

Position of Lscc in the hierarchy. Only 9 TFs are "upstream" from the 
Lscc in the sense that there arc paths from these TFs to the LscC; of these 9 
paths 8 are single edges and one consists of two edges. 

3 Conclusions 

We inspected graph-theoretic properties of the cycles in the transcription net- 
work of Saccharomyces cerevisiae. While in general cycles are "avoided" by the 
network, interactions common to all phases of the cell cycle form a big exception, 
and interactions specific to the stress response form a smaller exception. 

In spite of their modest number (they involve 25 of 142 transcription factors 
that were included in the data set), the transcription factors that are included 
in cycles have a large topological impact: most of the shortest paths between 
transcription factors and terminal targets go through them. 

One should compile many kinds of data to establish the exact role of the 
cycles of transcription interactions in controlling life processes. In particular, cell 
cycle, which is closely related to cancer, possesses a long cycle that can be easily 
interrupted at many different points, and the process itself can be interrupted 
by a number of different conditions (like DNA damage). 

We have shown that Lscc is a key part of the regulatory network and that it 
can be divided into functional subunits. Further work will yield fuller and clearer 
picture of these subunits and their interactions under various conditions. 

4 Methods 
4.1 Data 

We used supplementary materials for f4j (at http : //sandy . topnet . gersteinlab'. 
org/index2.html); we also used supplementary materials of [12 and the list of 
yeast homologs of human cancer genes personally communicated by Haiyuan 
Yu. 



4.2 Graph-theoretic definitions 

A graph of a network consists of nodes (which correspond to TFs, transcription 
factors and TTs, terminal targets) and directed edges/interactions. 

A path in a graph is a sequence of nodes (uq, . . . , Ufc-i) such that each con- 
secutive pair Ui) is an edge. If additionally there exists an edge {uk-i,uo) 
we say that this is a cycle. 

A single node (u) forms a degenerate cycle. 

Nodes in a graph are partitioned into strongly connected components, or 
SCC's. A node u is contained in SCC(u) which is the miion of the node sets 
of all cycles that contain u. 

SCC's with one node are called trivial. 

For graph G we define graph Gscc, the graph of SCC's of G. Nodes of Gscc 
are scc's of G, and edges are pairs of the form (scc(w), SCC(t;)) such that {u,v) 
is an edge of G. 

Gscc cannot have cycles of its own, and therefore it is easy to compute 
longest paths in that graphs (the algorithm is considered folklore). The paths 
lengths in that graph are used in Table [1] 

We use LSCC to denote the largest strongly connected component in a graph. 
We apply this definition when the majority of elements of non-trivial SCC's 
belongs to one of them, so there is no ambiguity as to which one is "the largest" . 

4.3 Algorithms 

To compute non-trivial scc's we first obtained a "dictionary" protein code <-> 
number followed by pairs of numbers representing the edges. We computed SCC's 
and the graph of SCC's using the algorithm of Tarjan |9j. His method is usually 
described in textbooks of algorithms for biconnected components (the difference 
between two algorithms is contained in one line of code). Shortest paths used in 
subsection 12.31 were computed using breadth first search. 

We implemented this algorithm in two programming languages: in awk, which 
makes it very easy to compare the result with various text files, and in C which 
allows to perfom very quickly thousands of statistical tests. 

4.4 Defining motifs, generating random graphs 

We define a feed-forward loop (ffl for short) as a triple of nodes {uq, ui, U2} such 
that there exists three edges: two form a path (uo,ui,U2) while the third forms 
a shortcut, {uq, U2). A bi-fan is a quadruple of nodes (uq, wi, wqi ''^i) such that all 
of the 4 possible edges of the form {ui,Vj) exist. 

When we count ffl's and bi-fans we remove the self-loops (edges of the form 
(u,u)) from the graph. Moreover, every triple/quadruple is counted separately, 
even when they share nodes. 

To count ffl's and bi-fans we made a table Overlap that for a pair of TFs 
stored the number of common targets. For every positive entry k = Overlap{a, b) 
we add k{k — l)/2 to the count of bi-fans, and if there is an edge from a to b, 
we add Overlap{a, b) to the count of ffl's. 



The method of generating random graphs was adapted from Milo et al. [B]. 
For a node u with in-degree a and out-degree b we conceptuaUy make a "in- 
stubs" and b "out-stubs" , in actuahty, we have an array in which b positions 
are reserved for the adjacency hst of u, and m is in a locations of that array. 
Then we pick a random permutation of the array content. Subsequently, we sort 
the adjacency lists. Finally, we scan the adjacency lists and we correct "errors" 
which are of two types: a list of some node v contains u for the second time, 
or it contains v itself. We try to exchange the offending array position with a 
randomly chosen other position until the exchange does not introduce an error 
of its own. 

Similar approach is used to "boost" the number of bi-fans or ffl's: a random 
swap of two array positions is accepted if it does not introduce an error and it 
increases the respective count (ffl's to 997 or bi-fans to 61,034). 
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